Biological Pattern Discovery with R Machine Learning Approaches (Zheng Rong Yang)

luster centres of a model are normally unknown when starting a

g process for a data set. Therefore, data points cannot be assigned

rs without known cluster centres. A basic exercise used in the

learning community thus comes to the front, which is the so-

al and error learning strategy. Its basic principle is a repeated

f estimating model parameters. In the trial stage, a set of model

rs is either assigned by a set of random values or updated

g to the direction by a pre-defined error function is decreased or

d. In the error stage, a new error is estimated based on the fitness

nce of the model with the current parameters. The new error thus

a new parameter update direction.

luster centres of a model as the model parameters are assigned by

values as guesses at the beginning although they are definitely

e. These guessed cluster centres are then examined to see whether

accurate or not. If not, they are updated towards a direction at

pre-defined error function can be reduced. Gradually, the

g error is diminishing or the cluster centres of a model stop to

hen the optimal cluster structure has been found.

mbership by which the n^th data point (ܠ௡) of a data set ࣞ belongs

cluster is denoted by ݂௡௞ and the K-means membership function

d as below,

݂௡௞ൌቄ¹

ܠ௡belongs to cluster ݇

otherwise

(2.20)

K-means algorithm, the error is defined as the sum of the squared

between the data points and the cluster centres to which they

uppose there are K clusters, each data point has K distances. Each

s the distance between a data point and a cluster centre. Only the

ance is counted into the total error, i.e., the objective function.

error or the objective function of a K-means model is then

s below, where ࢛௞ is the centre of the k^thcluster and ݂௡௞ is the

hip by which the n^th data point (ܠ௡) belongs to the k^th cluster,

either a zero or a one,